Project Description

Prologue

The coronavirus affected all of our lives severely in the last year; whether we got sick or not, whether we had severe symptoms or not. Therefore, when I saw a competition in Kaggle with the goal to find a better way to diagnose COVID19 cases, I was interested in the challenge both on the academic and personal levels. I find this also an opportunity to participate in a live competition and be a part of the Kaggle community.

Project Description

Overview

Here in Israel, it can be said that the Coronavirus is over. (This sentence was written three months ago, and was left here as a warning from overoptimism.) After an amazing vaccine campaign, and after over 5 million people got vaccinated in a very short period of time, the pandemic has almost completely stopped.
But all over the world, the pandemic is still spreading. As of today, there are 14M active cases worldwide and about 500K daily cases on average. All over the world, governments are racing against the virus by running their own vaccine campaigns. But the virus is still faster. So the effort to slow down the disease spreading is now of most importance.
As we all learned in the last year, one of the key tools in this context is the early and large-scale detection of infections. The main method used for infection detection now is PCR tests. But these tests have their own disadvantages - their costs make it difficult to apply them at a large scale, and they have a lower bound on the results time, so it seems to be a good idea to search for additional tools for infection detection. It's well known that COVID19 causes shortness of breath. But this phenomenon can also serve us - if the virus influences the lungs so strongly, we can try to detect the infection by examining the lungs.

In this project, we will develop a way to use chest radiographs (CXR) for COVID19 infection detection. This could be a fast way to early determine COVID19 infection and could be another stone in the effort to block the virus.

Competition Description From Kaggle

Follows is the description of the competition from the Kaggle website:

Five times more deadly than the flu, COVID-19 causes significant morbidity and mortality. Like other pneumonias, pulmonary infection with COVID-19 results in inflammation and fluid in the lungs. COVID-19 looks very similar to other viral and bacterial pneumonias on chest radiographs, which makes it difficult to diagnose. Your computer vision model to detect and localize COVID-19 would help doctors provide a quick and confident diagnosis. As a result, patients could get the right treatment before the most severe effects of the virus take hold.

Currently, COVID-19 can be diagnosed via polymerase chain reaction to detect genetic material from the virus or chest radiograph. However, it can take a few hours and sometimes days before the molecular test results are back. By contrast, chest radiographs can be obtained in minutes. While guidelines exist to help radiologists differentiate COVID-19 from other types of infection, their assessments vary. In addition, non-radiologists could be supported with better localization of the disease, such as with a visual bounding box.

As the leading healthcare organization in their field, the Society for Imaging Informatics in Medicine (SIIM)'s mission is to advance medical imaging informatics through education, research, and innovation. SIIM has partnered with the Foundation for the Promotion of Health and Biomedical Research of Valencia Region (FISABIO), Medical Imaging Databank of the Valencia Region (BIMCV) and the Radiological Society of North America (RSNA) for this competition.

In this competition, you’ll identify and localize COVID-19 abnormalities on chest radiographs. In particular, you'll categorize the radiographs as negative for pneumonia or typical, indeterminate, or atypical for COVID-19. You and your model will work with imaging data and annotations from a group of radiologists.

If successful, you'll help radiologists diagnose the millions of COVID-19 patients more confidently and quickly. This will also enable doctors to see the extent of the disease and help them make decisions regarding treatment. Depending upon severity, affected patients may need hospitalization, admission into an intensive care unit, or supportive therapies like mechanical ventilation. As a result of better diagnosis, more patients will quickly receive the best care for their condition, which could mitigate the most severe effects of the virus.

Understanding the challenge

This challenge, as well as the dataset itself, is composed of two levels. The first is the image level which contains the chest radiographs, and above it we have the study level, which contains the general conclusion from all the patient radiographs.
On the study level, each study is classified by specialists as Negative for Pneumonia, or as Typical Appearance, Indeterminate Appearance, or Atypical Appearance to Covid-19. The grading system is based on this paper which proposes a new reporting language for chest radiographs (CXR) findings related to COVID-19, as described in the following table (Table 1 in the paper):

Radiographic Classification CXR Findings Suggested Reporting Language
Typical appearance Multifocal bilateral, peripheral opacities Opacities with rounded morphology Lower lung–predominant distribution “Findings typical of COVID-19 pneumonia are present. However, these can overlap with other infections, drug reactions, and other causes of acute lung injury”
Indeterminate appearance Absence of typical findings AND Unilateral, central or upper lung predominant distribution “Findings indeterminate for COVID-19 pneumonia and which can occur with a variety of infections and noninfectious conditions”
Atypical appearance Pneumothorax or pleural effusion Pulmonary edema Lobar consolidation Solitary lung nodule or mass Diffuse tiny nodules Cavity “Findings atypical or uncommonly reported for COVID-19 pneumonia. Consider alternative diagnoses”
Negative for pneumonia No lung opacities “No findings of pneumonia. However, chest radiographic findings can be absent early in the course of COVID-19 pneumonia”

Although these findings refer to the CXR themselves, on this challenge we were provided with these labels only at the study level, while each study can have many images. On image level, each image has a list of bounding boxes of findings. The bounding boxes can contain findings from different types, as described by the competition hosts:

Bounding boxes were placed on lung opacities, whether typical or indeterminate. Bounding boxes were also placed on some atypical findings including solitary lobar consolidation, nodules/masses, and cavities. Bounding boxes were not placed on pleural effusions, or pneumothoraces. No bounding boxes were placed for the negative for pneumonia category.

The dataset doesn't distinguish between the findings type. The findings were given the label opacity, and the prediction in the submission for the findings class should be always opacity.
The details of the study grading method according to the findings in the images described in the table above. Even though the exact meaning of the terminology is definitely beyond my understanding, one thing we can learn from this table is that the classifying is based on the nature of findings, as well as on their region in the lungs. This is crucial for a better understanding of what our model is supposed to learn.

Notebook initialize

Basic Imports

Enviroment Settings

To run this notebook I switched between Kaggle kernels and Google Colab. The main advantage of the Kaggle kernels is that the competition data came built-in for the kernel, and the disk is relatively fast. On the other hand, Google colab is a much more convenient environment, not to mention the free GPU while Kaggle kernels are limited to 30 GPU hours per week. Downloading the dataset from Colab and saving it on a mounted drive doesn't work - the data was too big and crashed the kernel. To be able to work on Google colab, I downloaded all the competition dataset on a local machine with a high upload rate and uploaded it to Google Drive. Then I could mount my drive on Colab kernel and get access to the data - but the performance that way was much worse than on Kaggle. The code below is used to switch between the environments.

EDA

The dataset is composed of three parts. The CXR files in DICOM format, and two metadata tables: one for the image level and another for the study level. Let's explore first the image-level metadata of the training set.

Before doing anything else, we'd like to change this terrible column name StudyInstanceUID to a more reasonable one.

Now it's much better.

For each image we are provided with image id, study id, the findings bounding boxes, and labels for each bounding box. Let's examine the label column first. The content of the label column corresponds to the submission's desired format. It contains a description of an unlimited number of finding, separated by whitespace. Each of the descriptions contains 6 fields, also separated by whitespace, as follows:

finding_label confidence xmin ymin xmax ymax The content of this row is as this pattern, repeated as the number of this image's findings. So if we have $k$ finding for specific image, the label row will be: finding_label_1 confidence_1 xmin_1 ymin_1 xmax_1 ymax_1 finding_label_2 ... finding_label_k confidence_k xmin_k ymin_k xmax_k ymax_k

Let's extract these values.

Now we can see the domains of these values

The labels are only none and opacity, and the confidence in the training set is always 1 (since this is a labeled dataset). All bounding box data is provided in the boxes field, this field is Nan when there are no findings (as we can see in the second row in the head of the Dataframe printed above). So in fact, all the data we need exists in the boxes field.
Thus, we can extract the findings data directly from the boxes fields and examine some of their properties.

Before further exploration of findings properties, let's explore the study-level metadata:

In this dataframe each study is classified into one of 4 classes: Negative for Pneumonia, Typical Appearance, Indeterminate Appearance, and Atypical Appearance. It is important to know how these classes are distributed over the dataset.
In the evaluation section in the competition details in Kaggle it's said that

Studies in the test set may contain more than one label. They are as follows: negative, typical, indeterminate, atypical

Accordingly, this is a multilabel classification task.
In contrast, in a post in the competition discussion section, the hosts indicated that

Per the grading schema, chest radiographs are classified into one of four categories, which are mutually exclusive

Since the two descriptions contradict, it is worth inspecting the training set to see which labels can be assigned to an image together.

So in the training set, each of the studies has a single label attached to it, and this classification is in fact one-hot encoded classification for each study to one of those 4 classes.

Since the results of our check on the training set supports the second post, and since inherently, by their meanings, the labels seem to be mutually exclusive , we will leave it as a single-label classification task. For convenience , we will store the labels in one row rather than in one-hot encoding format and join it with the images dataframe.

Next we will check how these classes are distributed over the dataset.

We have 47% certain Covid cases (typical appearance), 36% non-covid (28% negative for pneumonia and 8% atypical to covid), and 17% obscure cases. From a covid vs non-covid point of view, the dataset is quite balanced. But from the classification point of view, almost 50% of the cases are from one class and only 8% of the cases are from another class.
To better understand the distribution, let’s see that classification distribution in absolute numbers:

Next, let's look at some properties of the findings. The number of findings varies for each image. Each of them is an opacity (or another type of the above-mentioned findings) in the CXR, and for each of them we provided the bounding box of the opacity area. Let’s look at the number of findings per image and the main statistical properties of their areas: sum, mean, max, etc.

The findings count for each class label is:

It is clear now that all the negative results have no findings at all, as stated in the grading method table. On the other hand, for each of the other four types it seems that there are instances with no opacity findings, contrary to these grades descriptions in the above table. But is this really the case? We saw earlier that we have more images than studied. That is, some studies have more than one image. So it seems that in some cases the prognosis is based on findings that are determined only in one of the scans. Let's verify this conclusion.

It becomes clear that the above conclusion is correct. Almost all of the clear covid-19 cases have 2 findings, and a couple of them with 3 findings. The indeterminate cases also have at least one finding each, and only the non-covid cases sometimes have no findings, even when positive to pneumonia. But according to our table, No Findings means Positive to Pneumonia, so we'll put these instances aside for now.

Now all the positive cases have findings. Let's inspect other findings properties:

One can see that clear covid cases strongly tend to have larger findings area. The indeterminate cases also tend to have larger findings areas than the atypical ones, but this difference is much less significant. Let's inspect these features again, but now at the study level.

It seems that there is no significant difference between the image level and study level. But, this leads us to two important questions: How many images are related to one study on average? In case that a study had more than one image, how many images the prognosis is based on?

In most cases there’s one image per study. But in the cases with multiple images, what is the difference between the images? Is the prognosis based on all of the images?
It is straightforward to get an answer to the second password from the data. We simply count the number of images labeled with findings.

So it is established that , there is never more than one image labeled with findings per study. Now it will interesting to see the difference between the images in one study. To do that, we have to pay attention to the third and most important part of our dataset - the DICOM files.

DICOM files

The data is provided in DICOM fromat, which is the standard in medical imaging information and related data. This format packs each medical image with related data, such as Patient Id, Name, Sex, etc. In our case, the data de-identified for privacy reasons, but we still may have important data in the metadata provided in the DICOM file. Let's pick a file and see what it looks like.

Let's take a taste from our data:

Many of the images are cropped, rotated, and have different lummination level. Lungs are contained in all of the images, but the location of the lungs in the image is not constant, The images margin size are varying, and the images may contain other body parts - neck, stomach, hands, etc. To get a better understanding of the matter in hand, it will be helpful to see CXR from the different labels with the annotated bounding boxes drawn on the image.

The results are not clear for the non-expert eye. Although sometimes there is a kind of opacity in the boxes, in other cases there is no clear difference between the area inside and the area outside the box. This will affect the algorithm development and verification processes since I will not be able to rely on my own knowledge and intuition.

Now let's take a look at the metadata provided by the DICOMs. The attributes that may interest us are the body part examined, sex, image size, pixel spacing (represent the physical size of the image), modality (the scanning method), and image type. Let's extract these features to a pandas dataframe for later use.

Let's inspect the value ranges of our new data:

The first thing to inspect is the sex field. How is our data splitted between the sexes? How the sex is related with the COVID19 prognoses?

We can see here the well-known fact that statistically women suffer less from Covid-19. Although our data is quite balanced with respect to sex, women suffer much less from pneumonia of any kind. In the typical covid cases (clear/severe covid cases) there are about two-thirds cases of women than men.

In Body Part Examined column, we have the unique values

'CHEST' 'PORT CHEST' 'TORAX' nan 'T?RAX' 'Pecho' 'THORAX' 'ABDOMEN' 'SKULL' '2- TORAX' 'TÒRAX' 'PECHO'

'CHEST', 'THORAX' (which exists in many versions), and 'PECHO' are all the same, whether you prefer English or Spanish. Let's try to see what else we can extract from the metadata. (Why do we have SKULLs here???)

The ABDOMEN images seem to contain the body's lower part too. Besides that, there does not seem to be a significant difference between the images group (in particular, we have no SKULLs here).

Now we can inspect the difference between images in the same study.

It can be seen that images in the same sometimes identical or almost identical, sometimes the only difference is in the image post-processing (croping, lighting, etc.) and in few cases the study contains an additional image from aother point of view.

Evaluation

The evaluation at the study level could be quite simple: we only have to check the prediction accuracy.

But at the image level, the predictions will be bounding boxes. Probably the bounding boxes will not match exactly to the labeled ones, and we do not care if there are minor differences. So how will we decide whether our predictions are consistent with the labels or not? Come to think of it, the most important thing here is how much our predictions area intersect with the ground truth labels. For ideal prediction, the predicted area will match the ground truth exactly. Meaning, the intersection and the area of each prediction and the ground truth are equals. In a more realistic case, our prediction isa bit smaller or larger than the ground truth, or span out in one direction and too short in another. In all these cases, the more the intersection area is large with respect to both the predicted area and the ground truth label area, the more we can regard the prediction as correct. This is the rationale behind the PASCAL VOC2010 IoU (Intersection over Union) evaluation method, which is used in this competition: a bounding box prediction is considered correct if the rate of the intersection over union of the prediction and the ground truth is greater than $0.5$. i.e, we demand $$IoU = \frac{A_y \cap A_\hat{y}}{A_y \cup A_\hat{y}} > 0.5$$
Where $A_y$ is the ground truth bounding box area and $A_\hat{y}$ is the area of the predicted box.

But deciding whether a predicted bounding box matches to a target bounding box is not enough. Since there is varing number of bounding boxes in each image. So we have to find a way to evaluate how far the predicted set of bounding boxes from the targets. Of course, we could regard each of redundant bounding box as a false prediction, and than calculate the accuracy over all the predictions. But this way has a sgnificant drawbacks: In this way, must predict exactly the target labels, at 100% confidence. Unlike regular classification task, where the model gives scores for each class so we can interpret not only the chosen class, but also the probablities for all other classes, here the model cannot estimate different options. The model have to predict only the most cofident boxes. But somtimes we want to get a wider view. For example, we may want to minimize the false negative detections by take in account all the bounding boxes that may contain specific class under some confidence threshold. For this reason, VOC dedmonstrate a dedicated metric called Mean Avererage Precision, or mAP. The main idea behind this metric is to consider first each of the classes to be detected seperatle, as binray classification against all other classes. Then we have to calculate te average precision of this binary classification. In average precision we sort the prediction by their confidence score. Now for a given recall $r$, check which confidence threshold we have to set in order to get this recall, and then we calculate the precision of the prediction above this threshold. The average precision is the average precison over the recall segment $[0, 1]$. The mean average precision then calculate as the mean of the average precision of each class.

In order to use the same evaluation method for the study label and for the image level, the competiton hosts decided to use this method also for the classification method. This means that instead of predicting one class for each study case, we have to give a score for each of classes. The main meaning of this is that the predictions of each class are not dependent.

Training a Model

Study Level

Baseline model

In this project I choose fastai as the main library. Fastai is a framework built upon Pytorch and provides easy and fast way to build ANN models.

As a baseline model we will take the most simple resnet, trained as recommanded as base standard by fastai - using the largest batch size fisible. Fastai also recommended to use their LR finder to determind the learning rate, but my experiments in this dataset show that using Adam (fastai default optimizer) and one cycle policy, anything in the range $1e-2$ - $1e-3$ works well, and this is more or less the value given by the LR finder. Thus, I set the One Cycle max_lr parameter to $1e-3$ for all the following experiments.

For the baseline we want to keep the model and the training and prediction process fast and simple in terms of computational resources, so we will resize the images resolution to 256$\times$256.

At first, let's install the latest version of fastai.

For training a model, it's much more convient and efficient to convert the DICOM files to JPEG. This will also allow us to work on Colab kernels instead of kaggle kernel, which allow free GPU usage and more convient environment. Some kagglers already did it and created several datasets with JPEGs from different resolutions. For the training process we will use a set of dataset created by a kaggler called Awsaf who created three dataset for resolutions of 256, 512, 1024.

Here I attached the colab to my google drive on order to copy my kaggle credentials and downloaded those dataset:

Beside the imports, we have here the random_seed function I took from kaggle, which sets the random seed to python random functions, numpy, torch and cuda. Using this function with constant seed is very useful for the reproducibility of thק experiments.

From now we will use the train.csv file that comes with the jpeg files dataset, which maps each image id to its study id, boxes, and the classes columns for this study from the study level file. We will merge the DICOM_meta dataframe we have created in the EDA to this table, so we will have all the necessary data in one table.

First we have to split our data to training and validation sets. Fastai porvides their own splitter in their dataloaders creator, but their splitter is not stratified, and that is important especialy in such imbalanced data, so we will use the sklearn splitter. Another reason to use sklearn is that we have in some cases several images in single study. These images are highly correlated and sometimes almost identical, so to prevent data leakage we want to split the data according the study id so that all the images of same study will be in the same split.

The first thing to do is creating dataloaders to load the training and validation data into the model. For now, the only transform we want to apply on the data is resizing the images to 256$\times$256. To save time and resources, we will use here pre-resized images instead of insert the resizing into the dataloader pipeline. Since we want to use imagenet pretrained weights, we need also to normalize the data to imagenet stats, but fastai will take care of this automatilcally.

Once we have the dataloaders, we can create and train the baseline model.

We will run here LR finder to show that it indeed found a value in the range $(1e-2, 1e-3)$ as described above.

The model achive about 0.6 accuracy on the first few epochs, and then stop to progress. Looking as the train loss against the validation loss in the training log and in the above plot, we can see that the model starts to overfit from the 4-5th epoch.

Data Augmenations

To prevent the overfitting we will apply data augmentation transforms. Fastai provides aug_transforms method which unite handfull of image transformations like roations, flips, zooming in and out and others. From the images printed above, we can see that some of the images are rotated by about 15 degrees, so we wil set the rotation max degrees to 15. The other defaults of the mehtod aug_transforms seems to be appropriate for our data.

Here the new dataloader, configured with the augmentation transforms:

We will use now this dataloaders to retrain the model:

The augmentations prevented the overfitting and the result imporved. But at the last epoch the training loss get very close to the validation loss, so that the overfitting is still one or two epochs ahead. Below the model is trained for more 3 epochs, so we can see it overfits.

We need a stronger augemtations in order to be able to train more epochs:

Now the transforms are much stronger. Let's try this.

Much better. Now we can continue the training.

We achived a stable result of ~ .62 accuracy on last epochs.

Let's now try to add different type of augmentation - mixup augmentation. In mixup augmentaion we mixing in some probablity two images from the batch. Of course, we need also to mix their labels as well - if we took .75 from negative image and .25 from typical image, we need to label the mixed image as .75 negative and .25 typical. From technical reasons, in fastai this method implemented as a callback in the Learner object.

We will also add a save model callback to save the best accurate model during this training.

We still far enough from overfitting, so we can continue.

Seems that we got the best we can achieve from this model.

Now we have a model we can export and make a submission on kaggle for the competition. The score of the model will be less then it's accuracy, mainly because in the competition we have to predict opacities bounding boxes on the image level, and we didn't develop it yet. But the score still can be used to compare between models.

Let's load the best model and export it.

The submission of this model got score of 0.322 public leaderboard.

This score makes sense, since about 50% of the score come from the image level, so score of 0.32 match to 0.63 accuracy of the model.

Deeper Resnet Architectures

Let's try now deeper versions of resnet. The next resnet version on pytorch model zoo is resnet34. Using this architecture will force us to reduce the batch size, because the 128 batch size will overflow the Colab GPU memory capacity for this architecture.

The deeper model did worse than the simpler, maybe because the smaller batch size. I tried to train deeper versions of resnet than resnet34 with different learning rates and batch size and for different number of epochs, but all the experiments give even worth results, I cannot find any reason.

Although not represented here, experiments with other architectures like densenet and efficientnet gave similar results to resnet18 and resnet34. On kaggle I found A training notebook for Efficientnet B7 on TPU with accuracy of about 0.65, but when I tried to replicate the training on Google Colab I found that for colab TPU I have to reduce the batch size by half, what cause drop in the accuracy again to something about 0.62-0.63

Higher Image Resolutions

We started with 256 image size, but maybe the model can give more accurate results with larger image size. . When increasing the resolution we again have to reduce the batch size to match the GPU mermory capacity. So we will set the bach size to 64 for 512 image size and to 16 for 1024 image size.

The increasing of the image resolution doesn't improve the model accuracy significantly. We will stay with resnet18 and image size of 256

Result Examining

Let's now examine the model results.

The accuracy is about 0.6. But we have 4 classes, so for better understanding the accuracy performance let's see the confusion matrix.

The model accuracy on Typical and Negative classes is quite good, but on Indeterminate and on Atypical the accuracy is very poor. The recall is very low, meaning that the model in fact cannot determine atypical and inderminate cases at all.

Let's look at the results:

These results don't mean anything to me, since I cannot read those CXRs. Ploting the opacity bounding boxes on the XRays may help to understand the results.

Generally speaking, we can see that when the opaciy area are small or asymmetric, the model tends to missclassify the image.

To better understanding the model's desicions, I tried to use Grad-Cam method, in which we comuting the gradient of the loss w.r.t the last convolutional layer and upsampling it to the original image size to obtain a heatmap that can show us the importance of ech part of the image in the model decision. Here the gard-cam heatmaps from 5 samples for each study class, with the finding bounding boxes plotted too. The predicted classes ploted above the images, red-colored for wrong classification.

The results are very confusing. The image's parts that are most important to the model are not related to the finding boxes, and sometimes are out of the lungs, or even out of the body at all. Maybe we can do better if we can turn the model attention to the lungs. We will return to it later.

Dataset Balancing

Regarding the small amount of data we have from Atypical and Indeterminate class, the model poor results on these label is not a big surprise. We can use oversampling balancing method to balance the data before the training.

The accuracy is much lower comparing the former model. Let's look at the confusion matrix:

Now the model is much more accurate on atypical, but it comes on expence of the typical class. Seems that the decision between typical and atypical is too hard - when the dataset didn't contain much atipycals, the model could ignore them and be accurate on the typicals. But now that the dataset contains equal number of typicals and atypicals, the model accuracy drops.

Although the total accuarcy of the balanced model, we can use in the compietions it for scoring the atypical and indertminate classes. Since the evaluation method calculates the scores for each class independetly, we will use for each class the best model for that class.

Using this model to predict atypical and indeteminate classes give a score of 0.316 on the public LB. But if this model have better accuracy on the atypical and the indeteminate classes, How using it on this classes gave worse results? The answer is, the first model was better in estimating the probablity that image is atypical than the balanced model, but it give genrally more probaility for any image to be typical. But sorting the images by their atypical porbablity prediction, there are still more real atypical on the top of the list in the regular model than the balanced model.

The reason that the regular model is better than the balanced must be related to the oversampling in some way. Maybe the balanced model overfitted on the atypical and interminated duplicated images.

Lungs Annotation

When examining the model results, we saw that the The model had a hard time recognizing small or assymetric findings. How can we help the model to determind them? We already know that these fingings found only in the lungs part of the image. If we could annotate the lungs on the images and tell the model where the lungs are, instead exploring the whole image after findings, the model will pay more attention to the important part of the image and it may improve the results.

In kaggle, we can find a this dataset which contains CXRs and lung masks. We can use it to train a lungs detetctor, and create annotations for our dataset.

Let's download the dataset from kaggle:

Downloading and preparing the dataset:

Some mask are missing. We have to filter out all the CXRs which the dataset do not contains masks label for them. Beside that, we will givethe masks equal names like the CXR files (the original names have multiple formats):

In the original dataset the lungs mask value is 255; i.e, the mask files contains 255 in the lungs region and 0 elsewere. Fastai desires the segmentation label for one class segemtation task to be 1, so we have to fix the masks.

Let's show the data:

Using this dataset we will train a lungs detector:

Let's examine the model result on the main dataset:

Excellent. With this model we will create a new dateset of lungs-annotated CXRs.